Shape of the Sampling Distribution¶

The shape of the sampling distribution relates to the following two cases.

  1. The population from which samples are drawn has a normal distribution.
  2. The population from which samples are drawn does not have a normal distribution.

Sampling from a Normally Distributed Population¶

When the population from which samples are drawn is normally distributed with its mean equal to $\mu$ and standard deviation equal to $\sigma$, then:

  1. The mean of the sample means, $\mu_{\bar x}$, is equal to the mean of the population, $\mu$.
  2. The standard deviation of the sample means, $\sigma_{\bar x}$ is equal to $\frac{\sigma}{\sqrt{n}}$, assuming $\frac{n}{N} \le 0.05$.
  3. The shape of the sampling distribution of the sample means $(\bar x)$ is normal, for whatever value of $n$.
In [2]:
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
n1 = 5
n2 = 15
n3 = 30
n4 = 50
sigma = 1
mu = 0

Let us consider a normally distributed population. For the sake of simplicity we use the standard normal distribution, $N \sim (\mu, \sigma)$, with $\mu = 0$ and $\sigma = 1$. Let us further calculate $\mu_{\bar x}$ and $\sigma_{\bar x}$ for samples of sample sizes $n = 5, 15, 30, 50 $.

Recall that for a large enough number of repeated sampling $\mu_{\bar x} \approx \mu$. Thus, $\mu_{\bar x}$ of the different sampling distributions under consideration.

$$\mu_{\bar x_{n=`r n1`}} = \mu_{\bar x_{n=`r n2`}} = \mu_{\bar x_{n=`r n3`}} = \mu_{\bar x_{n=`r n4`}} = \mu = 0$$

Recall the standard error of the sampling distribution $\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$. Thus, we can easily compute $\sigma_{\bar x}$ for $n= 5, 15, 30, 50$ elements. The different sampling distributions are visualized thereafter.

$$\sigma_{\bar x_{n=5}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{5}}\approx 0.447$$$$\sigma_{\bar x_{n=15}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{15}}\approx 0.258$$$$\sigma_{\bar x_{n=30}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{30}}\approx 0.183$$$$\sigma_{\bar x_{n=50}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{50}} \approx 0.141$$
In [4]:
seq = np.arange(-4, 4.01, 0.001)
n = [5, 15, 30, 50]
color = ["blue", "orange", "green", "red"]


plt.figure(figsize=(10, 7))
plt.plot(
    seq,
    stats.norm.pdf(seq, mu, 1),
    color="black",
    linewidth=2,
    label="Population distribution",
)
for i, c in zip(n, color):
    plt.plot(
        seq,
        stats.norm.pdf(seq, mu, 1 / np.sqrt(i)),
        color=c,
        linewidth=2,
        label="$\\bar{x}$ for n =" + f"{i}",
    )
    plt.vlines(0, 0, 15, color="darkgrey", linestyle="dashed")
    plt.yticks([])
    plt.ylim(0, 3)

plt.legend()
plt.text(0.1, 2.5, "$\mu_{\\bar{x}} = \mu$", fontsize=16)

plt.show()

There are two important observations regarding the sampling distribution of $\bar x$

  1. The spread of the sampling distribution is smaller than the spread of the corresponding population distribution. In other words, $\sigma_{\bar x} < \sigma$.
  2. The standard deviation of the sampling distribution decreases as the sample size increases.

In order to verify the 3rd claim from above, that the shape of the sampling distribution of $\bar x$ is normal, whatever the value of $n$, we conduct a computational experiment. For a large enough number of times (trials = 1000) we sample from the standard normal distribution $N \sim (\mu =0, \sigma = 1)$, where each particular sample has a sample size of $n = 5, 15, 30, 50 $. For each sample we calculate the sample mean $\bar x$ and visualize the empirical probabilities. Afterwards we compare the empirical distribution of those probabilities with the sampling distributions calculated from the equations above.

In [5]:
trials = 1000
n = [5, 15, 30, 50]
mu = 0


x = np.arange(-4, 4.01, 0.001)
color = ["blue", "orange", "green", "red"]

fig, ax = plt.subplots(2, 2, figsize=(10, 7))
fig.suptitle("Relative frequency distribution (occurrences) of $\\bar{x}$", fontsize=20)

for i, ax, c in zip(n, ax.ravel(), color):
    ax.plot(
        x,
        stats.norm.pdf(x, mu, 1 / np.sqrt(i)),
        color=c,
        linewidth=2,
        label="$\\bar{x}$ for n =" + f"{i}",
    )
    ax.hist(
        stats.norm.rvs(mu, 1 / np.sqrt(i), size=trials),
        density=True,
        color="lightgrey",
        edgecolor="darkgrey",
    )
    ax.set_ylim(-0.1, 3)
    ax.set_xlim(-2, 2)
    ax.set_ylabel("Density")
    ax.title.set_text(
        f"Empirical Probabilities vs.\nSampling Distribution for sample size n={i}"
    )

plt.tight_layout()
plt.show()

The figure verifies the 3rd claim from above: The shape of the sampling distribution of $\bar x$ is normal, for whatever value of $n$.

In addition, the figure shows that the distribution of the empirical probabilities (bars) fits well the sampling distribution (colored line), and that the standard deviation of the sampling distribution of $\bar x$ decreases as the sample size increases. Recall, that the y-axis represents the density, which is a the probability per unit value of the random variable. This is why the probability density can take a value greater than 1, but only over a region with measure less than 1.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.